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, Abstract 

In this work, we propose a PAC-Bayes bound for the generalization risk of the Gibbs classifier in the 
. multi-class classification framework. The novelty of our work is the critical use of the confusion matrix 

__i ■ of a classifier as an error measure; this puts our contribution in the line of work aiming at dealing 

with performance measure that are richer than mere scalar criterion such as the misclassification rate. 
Thanks to very recent and beautiful results on matrix concentration inequalities, we derive two bounds 
showing that the true confusion risk of the Gibbs classifier is upper-bounded by its empirical risk plus 
a term depending on the number of training examples in each class. To the best of our knowledge, 
this is the first PAC-Bayes bounds based on confusion matrices. 
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'• 1 Introduction 



The PAC-Bayesian framework, first introduced in lMcAUesteij (|1999b[ ). is an important field of research in 
^ ■ learning theory. It borrows ideas from the philosophy of Bayesian inference and mix them with techniques 

00 ! used in statistical approaches of learning. Civen a family of classifiers the ingredients of a PAC- 

■ Bayesian bound are a prior distribution *P over J", a learning sample S and a posterior distribution over 

I T. Distribution *P conveys some prior belief on what are the best classifiers from T (prior any access to 



S); the classifiers expected to be the most performant for the classification task at hand therefore have the 



I largest weights under The posterior distribution Q is learned/ adjusted using the information provided 



by the training set S. The essence of PA C-Bayesian results is to bound the risk of the stochastic Gibbs 
classifier associated with Q ICatoni — in order to predict the label of an example x, this predictor 



first draws a classifier / from T according to £} and then returns /(x). 

When specialized to appropriate function space and relevant families of prior and posterior distribu- 
■ tions, PAC-Bayes bounds can be used to characterize the error of different existing classification methods. 

I An example deals with the risk of methods based upon the idea of the majority vote. We may notice 

that if £} is the posterior distribution, the error of the £}-weighted majority vote classifier (which makes 
a prediction for x according to J2f /(^)^(/)) is bounded by twice the error of the Gibbs classifier. If the 
classifiers from J" the puts a lot of weight on are good enough, the bound on the risk of the Gibbs classi 



fier ca n therefore be an informative bound for the 0-weighted majority vote. iLangford and Shawe-tavlor 



(|2002n give a PAC-Bayes bound for Support Vector Machine (SVM), which depends on the margin of the 
examples. In their study, both the prior and posterior distribution are normal distributions, with different 
means and vari ances. Empirical results show that this bound is a good estimator of the risk of SVMs 



LangfordI (120051) 



PAC-Baye s boun ds can also be used to derive new supervised learning algorithms. For example. 



Lacasse et al.l (|2007r) have introduced an elegant bound on the risk of the majority vote, which holds for 
any space T. This bound is used to derive an algorithm, namely MinCq, which achieves empirical results 
on par with state-of-the-art methods. 



*This work was supported in part by the french projects VideoSense ANR-09-CORD-026 and DECODA ANR-09-CORD- 
005-01 of the ANR in part by the 1ST Programme of the European Community, under the PASCAL2 Network of Excellence, 
IST-2007-216886. This publication only reflects the authors' views. 
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PAC-Bayesian Generalization Bounds on the Confusion Matrix 



Some other important results are given in lCatonil (|2007l ). ISeeger and Seegeij (|2002^ . iMcAllesteij (|l999ah 



and lLangford et al.l (|200lh . 

In the present paper, we adress the multi-class classification problem. Some related works are there- 
fore the multi-class formula t ions for th e Supp ort Vector Machines, such as the frameworks presented in 
Weston and Watkind ( IQQSl) . Lee et al. ( 2004 ) and Crammer and Singer (2002). As majority vote meth- 



ods, w e can also cite multi-class adapt ations of the boosting method ca lled AdaBoost lFreund and Schapire 
(|l996^. suc h as the framework g i ven in Mukheriee and Schapird ( 2011), the AdaBoos t.MH/AdaBoost.MR 
algorithms ISchapire and SingeiJ (|l999l) and the SAMME algorithm lzhu et al.l (|2009h . 

The originality of our work is that we consider the confusion matrix of the Gibbs classifier as an error 
measure. We believe that in the multi-class framework, it is more relevant to consider the confusion 
matrix as the error measure than the mere misclassification error, which corresponds to the probability 
for some classifier h to err for its prediction on x. The information as to what is the probability for an 
instance of class p to be classified into class q (with p q) hy some predictor is indeed crucial in some 
applications (think of the difference between false-negative and false-positive predictions in a diagnosis 
automated system). To the best of our knowledge, we are the first to propose a generalization bound 
on the confusion matrix in the PAC-Bayesian framework. The result that we propos e heav ily relies on 
a matrix concentration inequality for sums of random matrices introduced by iTroppI (j201ll) . One may 
anticipate that generalization bounds for the confusion matrix may also be obtained in other framework 
than the PAC-Bayesian framework (e.g. uniform stability, online learning). 

The rest of this paper is organized as follows. Section [2] introduces the setting of multi-class learning 
and some of the basic notation used thro ughout the paper. Section [3] briefly recalls the folk PAC-Bayes 
bound as introduced in iMcAllesteil (|2003l ). In Section H we present the main contribution of this paper, 
our PAC-Bayes bound on the confusion matrix, followed by its proof in Section O We discuss some future 
works in Sectional 



2 Setting and Notations 

This section presents the general setting that we consider and the different tools that we will make use of. 
2.1 General Problem Setting 

We consider classification tasks over the input space X C M'' of dimension d. The output space is denoted 
by F = {1, . . . , Q}, where Q is the number of classes. The learning sample is denoted by S' = {(x^, j/i)}™ i 
where each example is drawn i. i. d. from a fixed — but unknown — probability distribution J) defined over 
X X Y. 'Dm denotes the distribution of a m-sample. C M.-^ is a family of classifiers f : X ^ Y. ^ and 
Q are respectively the prior and the posterior distributions over J^. Given the prior distribution *P and 
the training set S, the learning process consists in finding the posterior distribution Q leading to a good 
generalization. 

Since we make use of the prior distribution ^ on J^, a PAC-Bayes generalization bound depends on 
the Kullback-Leibler divergence (KL-divergence): 

KL{£l\m^^f^alog^y (1) 

The function sign(a;) is equal to +1 if .t > and —1 otherwise. The indicator function I(a;) is equal to 
1 if a: is true and otherwise. 



2.2 Conventions and Basics on Matrices 



Throughout the paper we consider only real- valued square matrices C of order Q (the number of classes). 
*C is the transpose of the matrix C, Idg denotes the identity matrix of size Q and is the zero matrix. 

The results given in this paper are based on a concentration inequality of iTroDpl (120 111) for a sum of 
random self-adjoint matrices. In t he case when a matrix is not self-adjoint and is real-valued, we use the 
dilation of such a matrix, given in iPaulsenI (|2002l ). which is defined as follows: 



5(C) 



def 



c 

*C 



(2) 
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The symbol || • || corresponds to the operator norm also called the spectral norm since it returns the 
largest singular value of its argument, which is defined by: 

||C|| = max{A„,ax(C), -A„,i„(C)}, (3) 

where Amax and Amin are respectively the algebraic maximum and minimum singular value of C. Note 
that the dilation preserves spectral information, so we have: 

A„.ax(5(C)) = ||5(C)|l=|lC|l. (4) 

An important property of the operator is the following: 

VaeM, ||a.C|| = |a|.||C||. (5) 

3 The Usual PAC-Bayes Theorem 

In this sec t ion, we recall the main PAC-Bayesian bound i n binary classification case as presented in 
lMcAllesteil(l20Q3h : ISeeger and Seeged (l20Q2h : iLangfordl (l2005l) . The set of labels we consider isy = {— 1;!} 



(with Q = 2) and, for each classifier j ^ the predicted output of x e X is given by sign(/(x)). The 
true risk and the empirical error Rs{f) of / are defined as: 

1=1 

The learner's aim is to choose a posterior distribution Q on such that the risk of the Q-weighted 
majority vote (also called the Bayes classifier) Bq is as small as possible. Bq is defined by: 

B£j(x) =sign [E/^q/(x)]. 

The true risk R{Bq) and the empirical error RsiBci) of the Bayes classifier are defined as the probability 
that it commits an error on an example: 

i?(i?Q) =^P(,,,)^s)(i?Q(x) ^y). (6) 

However, the PAC-Bayes approach does not directly bound the risk of Bq. Instead, it bounds the risk of 
the stochastic Gibbs classifier Gq which predicts the label of x g X by first drawing / according to £3 
and then returning /(x). The true risk i?(G'Q) and the empirical error -Rs'(G'q) of Gq are therefore: 

RiGQ)^Ef^aR{f) ; RsiGQ)^Ef^aRsif). (7) 

Note that in this setting, if Bq misclassifies x, then at least half of the classifiers (under 0) commit an 
error on x. Hence, we directly have: R{Bq) < 2R{Gq). Thus, an upper bound on R{Gq) gives rise to 
an upper bound on R{Bq). 

We present the PAC-Bayes theorem which gives a bound on the error of the stochastic Gibbs classifier. 

Theorem 1 (i.i.d. binary classification PAC-Bayes Bound). For any "D, any T , any^ of support any 
S G (0, 1], we have, 



"s^£)„ (V0 on kl{RsiGn),RiGn)) < ^ 



where kl{a, fe) ''^ a In f + (1 - a) In i^, and 5.'^= YJLo (T) (iM*(l - 

We now provide a novel PAC-Bayes bound in the context of multi-class classification by considering 
the confusion matrix as an error measure. 



i^L(0||q3) + In iM 



>l-<5, 



Technical Report V 2.0 



3 



E. Morvant, S. Kogo, L. Ralaivola PAC-Bayesian Generalization Bounds on the Confusion Matrix 



4 Multiclass PAC-Bayes Bound 
4.1 Definitions and Setting 

As said earlier, we focus on multi-class classification. The output space is F = {1, . . . , Q}, with Q > 2. 
We only consider learning algorithms acting on learning sample S — {(x^, j/i)}™ where each example is 
drawn i.i.d according to 2), such that \S\ > Q and ruy^ > 1 for every class yj G where my. is the 
number of examples of real class yj . In the context of multi-class classification, an error measure can be 
the confusion matrix. Concretely, for a given classifier f G J- and a sample S = {(x.;, j/i)}™^]^ ~ the 
empirical confusion matrix — ((ip<;)i<p,g<Q of / is defined as follows: 



V(p,g), =^ V I(/(x^) = q)l{y, 



The true confusion matrix T)-^ — {dpq)i<p,q<Q of / over S) corresponds to: 

y{p,q), dpg''= Ex|y=pl(/(x) = q) 

= P(x,y)~s(/(x) = q\p = y). 

If / correctly classifies every example of the sample S, then all the elements of the confusion matrix 
are 0, except for the diagonal ones which correspond to the correctly classified examples. Hence the more 
there are non-zero elements in a confusion matrix outside the diagonal, the more the classifier is prone 
to err. Recall that in a learning process the objective is to learn a classifier f G J- with a low true error 
{i.e. with good generalization guarantees), we are thus only interested in the errors of /. Our objective is 
then to find / leading to a confusion matrix with the more zero elements outside the diagonal. Therefore, 
we propose to consider a different kind of confusion matrix by discarding the diagonal values. The only 
non-zero elements of the new confusion matrix correspond to the examples that are misclassified by /. 
For all / G we define the empirical and true confusion matrices of / by respectively Cg — {cpq)i<p,q<Q 
and C-^ = {cpq)i<p,q<Q such that: 

y{p,qh 6,/=^^ I il^Zse, 

N / if g = p /Q>i 

V[p, q), cpq -]^dpq^ P(x,,)^x.(/(x) = q\p = y) otherwise. 

Note that if / correctly classifies every example of a given sample S, then the empirical confusion 
matrix is equal to 0. Similarly, if / is a perfect classifier over the distribution S, then the true 
confusion matrix is equal to 0. Aiming at controlling the confusion matrix of a classifier is therefore a 
relevant task. More precisely, one may aim at a confusion matrix that is 'small', where 'small' means as 
close to as possible. As we shall see, the size of a confusion matrix will be measured by its operator 
norm. 



4.2 Main Result: PAC-Bayes Bound on the Confusion Matrix of the Gibbs 
Classifier 

Our main result is a PAC-Bayes generalization bound over the Gibbs classifier Gq in this particular 
context, where the empirical and true error measures are respectively given by the confusion matrices 
from ([8|) and ([9]). In this case, we can define the true and the empirical confusion matrices of Gq 
respectively by: 

Given / ^ and a sample S ^ J)„i, our objective is to bound the difference between C^^^ and C^'-', 
the true and empirical errors of the Gibbs classifier. The structure that we will consider in the space of 
confusion matrices is the one induced by the operator norm (Equation on the set of matrices. This 
norm will allow us to formally relate the true and empirical confusion matrices of the Gibbs classifier and 
it also will allow us to provide a bound on the size ||C'^'^|| of the true confusion matrix. 
Here is our main result. 
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Theorem 2. Let X C he the input space, Y = {!,..., Q} the output space, T) a distribution over 
X xY (with Tim the distribution of a m-sample) and T a family of classifiers from X to Y . Then for 
every prior distribution ^ over T and any S G (0, 1], we have: 



\ VQ on J", ||Cc 



< 



KLi£l\m + \n[^) 



> 1 



where m_ = minj^=i^...^Q TOj^ corresponds to the minimal number of examples from S which belong to the 
same class. 



Proof. Deferred to Section [SJ 



□ 



Note that, for all y ^ Y, we need the following hypothesis: my > 8, which is not too strong a limitation. 
Finally, we rewrite Theorem [5] to have the size of the confusion matrix under consideration. 

Corollary 1. We consider the hypothesis of the Theorem\^ We have: 



"5~s„ <Vi3 onF, IIC^^II < II 




ifL(Q||q3) + ln(^) 



> 1 



Proof. By application of the reverse triangle inequality |||A|| — ||B||| < ||A — B|| to Theorem [21 



□ 



For a fixed prior *p on J^, both Theorem [5] and Corollary [T] yield a bound on the estimation (through 
the operator norm) of the true confusion matrix of the Gibbs classifier over alQ posterior distribution 
£} on F, though this is more explicit in the corollary. Let the number of classes Q be a constant, then 
the true risk is upper-bounded by the empirical risk of the Gibbs classifier and a term depending on the 
number of training examples, especially on the value to_ which corresponds to the minimal quantity of 
examples that belong to the same class. This means that the larger to_, the closer the empirical confusion 
matrix of the Gibbs classifier to its true matrix. These bounds use first-order information and vary as 
0(1/ ^TO_), which is a typical rate of bounds not using second-order information. 



5 Proof of Theorem [2] 

This section gives the formal proof of Theorem [5] We first introduce a concentration inequality for 
a sum of random square matrices. This allows us to deduce the PAC-Bayes gener alization bound fo r 
confusion matrices by fo l lowing the same "three step process" as the one given in iMcAUesteij (|2003l) : 
Seeeer and Seeeer ( 2002 ): LangfordI ( 2005 ) for the classic PAC-Bayesian bound. 



5.1 Concentration Inequality for the Confusion Matrix 

The main result of our work is based on the followi n g coro llary of a result on the concentration inequality 
for a sum of self-adjoint matrices given bv iTroppI ( 201l[ ) (see Theorem [3] in Appendix) - this theorem 
generalizes Hoeffding's inequality to the case self-adjoint random matrices. The purpose of the following 
corollary is to restate the Theorem [3] so that it carries over to matrices that are not self-adjoint. It is 
central to us to have such a result as the matrices we are dealing with, namely confusion matrices, are 
rarely symmetric. 

Corollary 2. Consider a finite sequence {M^} of independent, random, square matrices of order Q, and 
let {ai} be a sequence of fixed scalars. Assume that each random matrix satisfies E^Mi = and ||Mi|| < 
almost surely. Then, for all e > 0, 



^M,;|| >e^ <2.Q.exp(^ 



(10) 



J n dc f ^ — ^ o 

where a = a^. 



^This includes any Ij chosen by the learner after observing S. 
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Proof. We want to verify the hypothesis given in Theorem [3] in order to apply it. 

Let {Mi} be a finite sequence of independent, random, square matrices of order Q such that E^Mi — 
and let {a,} be a sequence of fixed scalars such that ||Mi|| < a^. We consider the sequence {iS(Mi)} 
of random self-adjoint matrices with dimension 2Q. By the definition of the dilation, we directly obtain 
E,5(M,) = 0. 

From Equation the dilation preserves the spectral information. Thus, on the one hand, we have: 

II^M.II = A,nax('5(^M,)') = A^ax(^5(M,)). 

i ^ i i 

On the other hand, we have: 

|1M,|| = ||5(M,)|| = A,nax(5(M,)) < a.. 

To assure the hypothesis 5(Mi)^ Af , we need to find a suitable sequence of fixed self-adjoint matrices 
{Ai} of dimension 2Q (where refers to the semidefinite order on self-adjoint matrices). Indeed, it suffices 
to construct a diagonal matrix defined as Amax('5(Mi))ld2Q for ensuring 5(Mi)^ =<; (Amax('5(Mi))ld2Q) . 
More precisely, since for every i we have A,nax('5(Mi)) < a.;, we fix A^ as a diagonal matrix with on 

the diagonal, i.e. A/=^aiId2Q, with || X^i ^fll = = 

Finally, we can invoke Theorem [3] to obtain the concentration inequality (jlOp . □ 



In order to make use of this corollary, we rewrite confusion matrices as sums of example-based confusion 
matrices. That is, for each example (xi,yi) G 5*, we define its empirical confusion matrix by C{ — 

{cpq{i))i<p,q<Q as foUows: 

defj ^ '^'^=P 

yp, q, Cpq{i) = < J_^y(^-) ^ ^ Otherwise. 

where ruy^ is the number of examples of class yi d Y belonging to 5*. Given an example (x^, jji) e 5, the 
example-based confusion matrix contains at most one non zero-element when / misclassifies (xi,?/i). In 
the same way, when / correctly classifies (xi,?/i) then the example-based confusion matrix is equal to 0. 
Concretely, for every sample S = {(x^, ^ and every f G J-, our error measure is then Cg = X)"=i ^f- 

It naturally appears that we penalize only when / errs. 
We further introduce the random square matrices C'{ : 



Es^s^Cf, (II) 



which verify E.C'f = 0. 

We have yet to find a suitable Ui for a given C'{ . Let Amaxi be the maximum singular value of C'{. It 
is easy to verified that Amaxi < Thus, for all i we fix Ui equal to 

Finally, with the introduced notations. Corollary [5] leads to the following concentration inequality: 

>el <2.g.exp('^V (12) 




Thi s inequalitv (fT^ afiows u s to demonstrate our Theorem [5] by following the process of iMcAllester 



()2003l ): ISeeeer and Seeee^ (|2002h : llangfordl (|200 



5.2 "Three Step Proof" Of Our Bound 

First, thanks to concentration inequality (|12p . we prove the following lemma. 

Lemma 1. Let Q be the size of Cg and C'( = C( — Eg^jimC^ defined as in (jlip . Then the following 
bound holds for any S € (0, 1]; 



<^ >1 
- 8a^ I - 



Technical Report V 2.0 



6 



E. Morvant, S. Kogo, L. Ralaivola PAC-Bayesian Generalization Bounds on the Confusion Matrix 



Proof. For readability reasons, we note Cg = ■ If ^ is a real valued random variable so that 

¥ {Z > z) < k ex-p{—n.g{z)) with g{z) non-negative, non-decreasing and k a constant, then P (exp {{n — l)g{Z)) > v) < 
min(l, fcz^^"/^"^^'). We apply this to the concentration inequality (fT^ . Choosing g{z) — z^ (non- 
negative), z — n — and k = 2Q, we obtain the following result: 

exp^i^llC'^ll) >.| <min(l,2g.-V(i-8.^)). 



Note that exp { ^^^2 \\^'^s\\^ always non-negative. Hence it allows us to compute its expectation as: 



E 



exp 



l_-8of 



exp 



C'^ll) >u\dv 



/OO 



For a given classifier / G J^, we have: 



E 



exp 



2Q~2Q 

2Q + 2Q 
2Q_ 

8cr2 II' 



l-8a2 



8a2 
l_-8a2 
8a2 



-8<tV(1-8ct^) 



< 



2Q_ 
8ct2 



Then, if *p is a probability distribution over T , Equation (jl3p implies that: 

2Q 



E 



E/^tp exp 



8a2 «l 



< 



8a2 



Using Markov's inequalitjH, we obtain the result of the lemma. 



(13) 

(14) 
□ 



The second step to prove Theorem [5] is to use the shift given in lMcAUesterl (|2003l ). We recall this result 
in the following lemma. 

(iMcAllesteil (l2Q03l) 'l. Given the Kullback-Leihler divergenc^ KL(0.\\^) between two distribu- 
tions *p and and let g{-) be a function, we have: 



Proof. See iMcAUesteij ()2003h . 



gib) < /^L(Q||q3) + lnE,^>p exp(g(&)) 



Recall that C'^ = J2Zi C'f ■ With g{b) = ^^^b"^ and h = |1C'^||, Lemma H implies: 



E 



8a2 11*^^11 



<ifL(Q||q3) + lnE/^qj 



exp 



^^ll<-5ll 



□ 



(15) 



The last step that completes the proof of Theorem [5] consists in applying the result we obtained in 
Lemma [1] to Equation (fT5)) . Then, we have: 



E 



< KL{£l\\^) + \n 



2Q_ 

8(72^- 



(16) 



Since g{-) is clearly convex, we apply Jensen's inequalitjQ to ([TC)) . Then, with probability at least \ — 5 
over 5*, and for every distribution £} on T , we have: 



E 



< 



KL{Q\\'^) + \n 



2Q_ 



(17) 



Since C'^=E"i 



1/ 



then the bound ([T7| is quite similar to the one given in Theorem [5J 



We present in the next section, the calculations leading to our PAC-Bayesian generalization bound. 



^see Theorem |4] in Appendix. 

^The KL-divergence is defined in Equation ITI l. 

*see Theorem in Appendix. 
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5.3 Simplification 

We first compute the variance parameter cr^ = X^I^i '^f • ^or that purpose, in Section [Ql we showed that 
for each is {1, . . . , m}, we can choose Ui = where yi is the class of the z-th example and niy. is the 
number of examples of class yi. Thus we have: 

m Q 1 '3 1 

^' = E^ = E E A-E— ■ 

^ — ' mf, ^ — ' ^ — ' TOf, ^ — ' niy 

i—l y—li:y^—y iJ V—^ 

For sake of simplification of Equation ((T7)) and since the term on the right side of this equation is an 
increasing function with respect to cr^, we propose to upper-bound cr^: 



1 

A. — ^ rn 



< 



m.y 

y=l « 



mmj^=i^...^, 



(18) 



Let m_ '=:^miny=i^..._Q rrij,, then using Equation (jlSp . we obtain the following bound from Equation (jl7p : 



(E/~n[||C' 



< 



ii:L(Q||«P) + ln 



4<5 



Then: 



E 



< 



(19) 



It remains to replace C'^ = ^Y^=\ 
E^-^qC^, we obtain: 



C/ 



Recall that d'^^ = E/^oEs^s„C;^ and Cf 



E/~q[||c'^I1] = E/~q 



E 



^[c{-E5.s„C,f 



m rn 

E[cf]-Eh-^cf 



= E/^Q 
> ||E;^Q 



C_5 — Es^ji^ 



EC/ 



1=1 



||E/^£jC^-E/^qEs^2,„C^| 
llCf'^ - C'^'^ll. 



(20) 



By substituting the left part of the inequality (IT^ with the term (OH)) , we find the bound of our 
Theorem [2l 

6 Discussion and Future Work 

This work gives rise to many interesting questions, among which the following ones. 

In the case of the classical binary PAC-Bayes framework, it is easy to show that the true error of the 
Bayes classifier ^ and the one of the Gibbs classifier Q are related by the following inequality: 

R{Bq) < 2R{Gq). 



One may notice that we do not immediately have a similar result for our confusion matrice setting. This 
question is out of the scope of the present paper and the proof of such a relation between the confusion 
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matrix-based errors of the Bayes classifier and of the Gibbs classifier for this framework is left for future 
work. 

Other perspectives will be focused on instantiating our bound g i ven in Theorem[21for s p iecific r nulti-class 



frame works, such as multi-class SVM lWeston a nd Watkins ( 199 81) :ICrammer and S inger (2002'); E eeetal 



20041) and muhi-class boosting (AdaBoost.MH/AdaBo ost.MR Schapire and Singer, (.2000 ), SAMME lZhu et al 



(2009), AdaBoost.MM Mukheriee and Schapird ( 2011 )). Taking advantage of our theorem while using the 



confusion matrices, may allow us to derive new generalization bounds for these methods. 

Additionally, we are interested in seeing how effective learning methods may be derived from the risk 
bound we propose. Fo r instance, in the binary PAC-Bayes setting, the algorithm MinCq proposed by 
Laviolette et al.l ( 2011 ) minimizes a bound depending on the first two moments of the margin of the Q- 



weighted majority vote. From our Theorem [2] and with a similar study, we would like to design a new 
multi-class learning algorithm and observe how sound such an algorithm could be. This would probably 
require the derivation of a Cantelli-Tchebycheff deviation inequality in the matrix case. 

Besides, it migh t be ve ry interesting to see how the noncommutative/matrix concentration inequalities 
provided by iTropnl (I2OIII) might be of some use for other kinds of learning problem such as multi-label 



classification, label ranking problems or structured prediction issues. 

Finally, the question of exten ding the present work t o the analysis of algorithms learning (possibly 
infinite-dimensional) operators as Abernethv et al. ( 20091) is also very exciting. 



7 Conclusion 

In this paper, we propose a new PAC-Bayesian generalization bound that applies in the multi-class clas- 
sification setting. The originality of our contribution is that we consider the confusion matrix as an error 
measure. Coupled with the use of the operator norm on matrices, we are capable of providing general- 
ization bound on the 'size' of confusion matrix (with the idea that the smaller the norm of the confusion 
matrix of the learned classifier, the better it is for the classification task a, t hand ). The derivation of our 
result takes advantage of the concentration inequality proposed by iTroppI (|201l[ ) for the sum of random 



self-adjoint matrices, that we directly adapt to square matrices which are not self-adjoint. 

The main results are presented in Theorem [2] and Corollary [T] The bound in Theorem [5] is given on 
the difference between the true risk of the Gibbs classifier and its empirical error. While the one given in 
Corollary [1] upper-bounds the risk of the Gibbs classifier by its empirical error. 

An interesting point is that our bound depends on the minimal quantity m_ of training examples 
belonging to the same class, for a given number of classes. If this value increases, i.e. if we have a lot of 
training examples, then the empirical confusion matrix of the Gibbs classifier tends to be close to its true 
confusion matrix. A point worth noting is that the bound varies as 0(l/^m_), which is a typical rate in 
bounds not using second-order information. 

The present work gives rise to a few algorithmic and theoretical questions that we have discussed in 
the previous section. 

Appendix 



Theorem 3 (Concentration Inequality for Random Matrices TroptJ ( 201lh ). Consider a finite sequence 



{Mi} of independent, random, self-adjoint matrices with dimension Q, and let {A^} he a sequence of fixed 
self-adjoint matrices. Assume that each random matrix satisfies EM^ = and M| =:5; A| almost surely. 
Then, for all e > 0, 



An 



= (EM.)>4<g.exp(^) 



where a'^'^= \\ A||l and ^ refers to the semidefinite order on self-adjoint matrices. 
Theorem 4 (Markov's inequality). Let Z be a random variable and z > 0, then: 

P(|Z| >.)<». 

z 

Theorem 5 (Jensen's inequality). Let X he an integrable real-valued random variable and g{-) be a convex 
function, then: 

f{E[Z])<E[g{Z)]. 
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